In case you still have stuff in your environment from the previous exercises, this is a reminder to clear your environment with rm(list = ls().
For the next steps, I would like to switch to the tidyverse. The tidyverse is an R package. In fact, it is actually a collection of various packages that come with many handy functions for data wrangling and visualisation. It is perfectly normal to use various R packages for additional functionalities and you will “collect” a lot of them over time. However, the packages of the tidyverse are a little bit special. The tidyverse has it’s own “coding style” which works slightly different from what we’ve learned so far. This has lead to a whole war between “base R” and “the tidyverse”, with many hardliners on both sides. Sometimes, the tidyverse is described as being especially beginner friendly and, well … tidy. Personally, I don’t know what all the fuzz is about: Both base R and the tidyverse have their pros and cons and honestly, I think they work best when used together. I use whatever I think works best for the task at hand - it’s not “either or”. In this course, I have to make decisions about what to teach you, and I will mainly show you the tidyverse way of doing it, even though I’m a huge fan of base R. But I can either teach you to do one thing in 50 different ways, or teach you how to do 50 different things instead, so I’m going with the latter. If you are interested, I can share materials with you that teach you how to do things “the base way”.
Let’s install the tidyverse together. You can install any package using the function install.packages(). It takes the package name as an argument. If you start typing the package name without the quotation marks, R will start suggesting packages to you which are directly available on CRAN (this is also where we installed R from).
Just uncomment the lines below - the code is commented out so it isn’t run every time you run this Rmarkdown file.
# install the tidyverse
# install.packages("tidyverse")However, just because you have a package, that does not mean you can use it. Packages only become “active” if you load them with the function library(). This seems a bit effortful, but actually has good reasons: Because there are so many packages, a lot of the use the same names for different functions. One function in one package can do an entirely different thing than the “same” function in a different package! So, if all packages would be loaded, it might be a little bit difficult to avoid confusion. Usually, you want to make a conscious decision about which packages you want to load in your script. Once a package is loaded, it stays loaded throughout the entire R session. Let’s load the tidyverse.1
# load the tidyverse
library(tidyverse)One of the most popular and noticeable features of the tidyverse is the pipe operator, which looks like this: %>%.2 It is used to create chains of functions, which means that you put functions after each other, linking them with the pipe operator. In base R, you achieve the same thing with a more “onion-like approach”. Let me show you what I mean by this. Say we have this vector:
example_vector <- c(1, 3, 33, 50, 5, 1, 2)We want to take the mean of this vector (using the mean() function), and then round the result to two decimal places (using the function round()). In base R, we would stack the functions into each other.3 They are then executed from the inside towards the outside.
# take the mean of the example vector and round the result to two decimal places
# (base R style)
round(mean(example_vector), 2)## [1] 13.57
We could also do the same thing stepwise (and sometimes this is the best approach if you want your code to be readable).
mean_example_vector <- mean(example_vector)
round(mean_example_vector, 2)## [1] 13.57
In the tidyverse, we chain those functions together using the pipe operator:
# this time: tidy style!
mean(example_vector) %>% round(2)## [1] 13.57
example_vector %>% mean() %>% round(2)## [1] 13.57
Note how we can leave out the actual data argument every time. mean() doesn’t use any argument at all. That is because the pipe “delivers” example_vector to the mean() function. That means, mean() does get an argument, only that it’s not coming from within the brackets, but from the pipe to its left. We see that round() still has the number 2 as its second argument, indicating that we want to round to two decimal places. However, the first argument has been omitted. That’s again because the data - the mean of example_vector in this case - is coming from the left. Let’s practice chaining some functions together.
1.1) Turn this “function onion” into a tidyverse chain:
(Can you find out what the function abs() does?)
another_vector <- c(3, -2, 5, 99, -132.5)# use abs on the vector, then take the mean and transform the result into a character
as.character(mean(abs(another_vector)))## [1] "48.3"
another_vector %>%
abs() %>%
mean() %>%
as.character()## [1] "48.3"
1.2) Let’s make things a bit trickier with additional arguments.
vector_with_NAs <- c(3, -2, 5, NA, 99, -132.5, NA)# use abs on the vector, show the first 4 elements with head,
# take the mean (with na.rm = TRUE) and turn into character
as.character(mean(head(abs(vector_with_NAs), 4), na.rm = TRUE))## [1] "3.33333333333333"
vector_with_NAs %>%
abs() %>%
head(4) %>%
mean(na.rm = TRUE) %>%
as.character()## [1] "3.33333333333333"
1.3) How do we get this into a pipe?
mean(another_vector * 2)## [1] -11
This does not work:
another_vector * 2 %>% mean()## [1] 6 -4 10 198 -265
We have to use parentheses to keep another_vector and *2 together:
(another_vector * 2) %>% mean()## [1] -11
The pipes are all fun and games for vectors, but they unlock their true potential when applied to data frames. Furthermore, the tidyverse offers a lot of functionalities for “data wrangling”, i.e. cleaning, formatting and processing data. Let’s load in the nerd data again.
nerd <- read.csv("./data_sets/nerd_data_short.csv", sep = "\t")The data contains a lot of information we don’t want to look at for now. How about selecting only the columns we want to work with right now? This is what the select() function does. We pipe our nerd data into it and then select the columns we want to keep. I want to keep the participants’ age, the gender, whether they’re married or not, and all of the “nerdy” questions, e.g. Q1 - Q26.
nerd %>%
select(age, gender, married, Q1:Q26)nerd %>%
select(age, gender, married, starts_with("Q"))# the base R way
nerd[c("age", "gender", "married", paste0("Q", 1:26))]A few remarkable things just happened. First of all, look at how I’m addressing the columns: I’m just using their names, without writing the name of the data frame again. That is, I’m writing age instead of nerd$age or nerd["age"]. That is not self-evident at all and a feature of the tidyverse’s “tidy evaluation”, called “data masking”. That means, you can use the columns of your data frame as if they were variables in your environment. You don’t have to refer to the data source (nerd, in this case), in order to access the columns. The tidyerse “knows” what you are referring to because you piped the main data source into the select() function.
And then there’s the fact that we don’t have to type out all of the columns Q1 - Q26 - we can actually just write exactly that: Q1:Q26 means column Q1 to Q26. Also note how the columns are now displayed in the order we typed them. Had we chosen a different order, our data would have been ordered differently as well.
Sometimes, you want to bring one column to the front, but you don’t really want to change anything else. You don’t want to name all the other columns, so there’s a shortcut for that: everything(). It literally throws in all the remaining columns you haven’t selected yet and is great for reordering. Say that, for some reason, we wanted the column voted (whether or not someone voted in the past election) to go first, then the age, and then all the rest. The corresponding code would look like this:
nerd %>%
select(voted, age, everything())Don’t forget the () behind everything!
2.1) I want to continue working with a reduced version of our data that only contains the columns age, gender, married, voted, nerdy and also Q1 - Q26. Can you select these columns and save this reduced data frame in a variable called nerd_red (“nerd reduced”)?
nerd_red <-
nerd %>%
select(age, gender, married, voted, nerdy, Q1:Q26)There are further options for selecting subsets of data. We just made a selection based on columns, but we can also only select certain rows. This can be achieved with the filter() function. Let’s for example only filter out the female participants, which are coded as 2 in the gender column.
nerd_red %>%
filter(gender == 2)# the base R way
nerd_red[nerd_red$gender == 2, ]We use the same notation as for logical comparisons, so make sure to use == instead of =! Here, we filter for women below the age of 30.
# filter for women below the age of 30 - use &
nerd_red %>%
filter(gender == 2 & age < 30)Interestingly, the logical & can be replaced with a comma in filter().
# filter for women below the age of 30 - use ,
nerd_red %>%
filter(gender == 2, age < 30)Filter nerd_red. Feel free to look at the code you wrote earlier in the course for the logical comparisons. Choose …
2.2) … anyone who is either 22 or 26 years old. There are several ways to achieve this!
nerd_red %>%
filter (age == 22 | age == 26)Another way is to use %in% (at first glance, it’s easily confused with the pipe). %in% tells you for each element on the left whether it also exists on the right. Check out this code to see how it works:
c("tea", "biscuits", "cake") %in% "tea"## [1] TRUE FALSE FALSE
"tea" %in% c("tea", "biscuits", "cake")## [1] TRUE
For our previous exercise, we can use it like this:
nerd_red %>%
filter(age %in% c(22, 26))2.3) … anyone who is either a man (gender == 1) or older than 30.
nerd_red %>%
filter(gender == 1 | age > 30)2.4) … anyone who is non-binary (gender == 3) and who is currently married or has previously been (1 = Never married, 2 = Currently married, 3 = Previously married).
nerd_red %>%
filter(gender == 3 & married != 1)Or:
nerd_red %>%
filter(gender == 3, married %in% 2:3)2.5) The column nerdy contains the agreement to the statement “I see myself as nerdy” (1 = disagree strongly - 7 = agree strongly). How many women have the highest self-reported nerdiness? How many have the lowest self-reported nerdiness? What about the men? Solve this task using filters.
# Women with the highest nerd rating
nerd_red %>%
filter(gender == 2, nerdy == 7) %>%
nrow()## [1] 135
# Women with the lowest nerd rating
nerd_red %>%
filter(gender == 2, nerdy == 1) %>%
nrow()## [1] 12
By the way, achieving the same thing using table() is much easier. Before, we’ve used table on single vectors, like this:
table(nerd_red$nerdy)##
## 0 1 2 3 4 5 6 7
## 1 20 31 36 126 234 311 241
But we can also use it on several vectors together (even more than two - try it out!). With this code, we get an overview about how many people from which gender choose which nerdiness-category:
table(nerd_red$nerdy, nerd_red$gender)##
## 1 2 3
## 0 1 0 0
## 1 8 12 0
## 2 18 12 1
## 3 14 21 1
## 4 40 77 9
## 5 82 141 11
## 6 100 196 15
## 7 87 135 19
When looking at the data summary, we already noticed that some of the reported values for age are impossible. We want to filter those out - what would you consider to be a good cut-off value?4
Also, marital status is supposed to be coded on a scale from 1 - 3, but there are five people where the entry is 0. We want to exclude these guys as well.
# table the married column
table(nerd_red$married)##
## 0 1 2 3
## 5 841 115 39
nerd_red %>%
group_by(married) %>%
count()summary(nerd_red$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 17.00 20.00 63.29 27.00 38822.00
Now we filter out the people with implausible values and overwrite nerd_red:
nerd_red %>%
filter(married %in% 1:3,
between(age, 0, 100))Note that there is no output when I assign something to a variable, so I would have to call the variable nerd_red again on order to display it.5
2.6) After filtering, what is the maximum age?
max(nerd_red$age)## [1] 38822
2.7) How many people of each gender are included in the data?
table(nerd_red$gender)##
## 1 2 3
## 350 594 56
2.8) How many people per marital status do we have?
table(nerd_red$married)##
## 0 1 2 3
## 5 841 115 39
2.9) How many people did we exclude in total? What could be a clever way to solve this?
nrow(nerd) - nrow(nerd_red)## [1] 0
We just check how many rows are missing in nerd_red as compared to the original data, nerd!
2.10) Check whether the nerdy column has only valid values, i.e. values from 1 - 7. What would your strategy be? Filter out invalid values if necessary.
table(nerd_red$nerdy)##
## 0 1 2 3 4 5 6 7
## 1 20 31 36 126 234 311 241
nerd_red <-
nerd_red %>%
filter(nerdy %in% 1:7)2.11) Check whether the voted column has only valid entries (1 or 2). Filter out invalid values if necessary.
table(nerd_red$voted)##
## 0 1 2
## 5 250 744
nerd_red <-
nerd_red %>%
filter(voted %in% 1:2)We know how to extract a vector from a data frame.
3.1) Quick recap: Save the age column into a single vector.
age <- nerd$ageBut what if we modified our data frame with a pipe and want to quickly extract one column? Sure, we could save the modified data frame and then extract the column.
3.2) Save a data frame where you filter all participants who are younger than 20. Then extract the age column.
nerd_young <- nerd_red %>%
filter(age < 20)
age_young <- nerd_young$ageHowever, we can do all of this in one go using the pull() function.
nerd_red %>%
filter(age < 20) %>%
pull(age)## [1] 17 18 18 17 15 18 16 15 17 14 17 18 15 16 14 17 15 17 18 19 13 16 17 15 19
## [26] 18 18 14 16 14 17 14 13 16 15 14 18 14 14 15 17 17 19 14 18 19 17 16 15 16
## [51] 15 18 16 16 14 18 17 13 17 14 14 15 17 18 17 14 16 19 19 18 17 15 16 16 18
## [76] 15 18 17 19 17 16 17 18 18 18 16 17 14 19 14 18 16 17 18 15 18 13 17 16 17
## [101] 17 17 13 16 18 18 18 19 17 13 16 14 19 13 15 16 14 18 13 19 15 15 16 13 13
## [126] 18 19 15 17 17 18 18 17 19 16 18 16 17 19 19 18 18 18 16 13 15 16 18 18 18
## [151] 15 19 19 15 15 16 17 19 18 14 14 19 17 14 13 17 19 17 19 19 14 18 16 19 16
## [176] 17 19 15 14 18 16 16 17 14 16 19 18 14 18 15 14 15 16 16 17 15 17 17 14 16
## [201] 15 16 13 15 15 18 14 19 17 18 18 17 18 18 16 17 16 15 18 19 17 16 18 19 18
## [226] 17 17 16 19 19 18 17 17 19 17 16 15 13 18 17 18 18 18 18 18 16 17 19 13 17
## [251] 18 17 18 15 19 16 18 16 16 16 14 15 18 19 18 19 19 19 19 13 16 17 17 17 18
## [276] 13 17 15 16 18 17 16 18 16 15 18 16 14 18 16 13 15 16 16 16 16 16 16 16 18
## [301] 19 16 19 13 14 15 15 14 15 18 18 15 16 17 19 15 16 18 13 17 18 15 16 17 17
## [326] 17 14 16 17 19 16 17 15 16 15 19 18 18 18 17 16 15 18 15 18 15 16 16 16 15
## [351] 15 13 15 16 15 17 19 18 16 14 19 18 17 18 18 18 15 18 17 19 19 18 19 19 17
## [376] 19 16 19 17 19 16 14 16 19 16 17 18 18 14 18 18 13 16 14 17 17 16 17 18 13
## [401] 13 16 18 17 19 15 15 18 18 18 17 17 16 13 18 19 16 18 18 17 18 17 17 19 18
## [426] 16 16 15 17 18 16 15 14 18 17 17 15 15 17 16 18 19 16 16 19 14 13 17 16 14
## [451] 18 13 15 17
You might wonder why we don’t just use select() to do the job. Here, we try to use select for the same purpose. Can you see what the problem is?
nerd_red %>%
filter(age < 20) %>%
select(age)When using pull(), what we get back is a vector. When using select, what we get back is a data frame with one column. However, if we only have a single column anyways, a vector is more comfortable to handle and more “lightweight” (The vector is acutally smaller than the data frame containing the same information - check out object.size(nerd_red["age"]) vs. object.size(nerd_red$age). PS: As opposed to nerd_red[["age"]], which returns a vector, nerd_red["age"] gives back a one-column data frame.
Another thing that is very useful is arrange() which lets you sort your data easily. For example, here is how you would sort the data by age:
nerd_red %>%
arrange(age)We can also arrange the data in descending order by wrapping the respective column in desc() (for “descending”).
nerd_red %>%
arrange(desc(age))Sorting with multiple columns is not a problem:
nerd_red %>%
arrange(age, gender)4.1) Use the code above, but switch the order of gender and age. What happens?
nerd_red %>%
arrange(gender, age)Before, we first sorted by age and then we sorted by gender within the age groups. Now it’s the other way aroud.
4.2) Arrange nerd_red by voted and age. Let voted be ascending and age descending.
nerd_red %>%
arrange(voted, desc(age))We want to add some columns to the data. Most of the time, we want to calculate something like a sum score or re-code questions. The good news is that most of this follows the same logic as vector operations which we saw before.
In a rather silly example, we could calculate the birth year of each participant. We know that data has been collected from December 2015 - December 2018, but let’s assume for now that all this data is from 2018. So to get the participants’ birth year (roughly), we would need to calculate 2018 - participant age.
Whenever we want to add a new column (or modify one) within the tidyverse, we use the mutate() function. Here is how things would work for our toy example:
nerd_red %>%
mutate(birthyear = 2018 - age, .before = gender)I used the .before argument so the new column would be added before the column gender. Otherwise, the new column would have been added at the end of my data frame. This argument is a little bit strange, because its name starts with a dot, but don’t worry about this6.
Note that birthyear has not been saved to nerd_red right now:
names(nerd_red)## [1] "age" "gender" "married" "voted" "nerdy" "Q1" "Q2"
## [8] "Q3" "Q4" "Q5" "Q6" "Q7" "Q8" "Q9"
## [15] "Q10" "Q11" "Q12" "Q13" "Q14" "Q15" "Q16"
## [22] "Q17" "Q18" "Q19" "Q20" "Q21" "Q22" "Q23"
## [29] "Q24" "Q25" "Q26"
In order to do that, we would explicitly need to overwrite nerd_red, which we won’t do here, because birthyear is a very useless column for us.
Likewise, we can use functions on columns. For example, we could turn the age column into a character. That doesn’t make sense in this example, but the opposite case is often true, where you want to turn a column that is accidentally coded as a character into a numeric column.
nerd_red <-
nerd_red %>%
mutate(age_character = as.character(age), .after = age)
nerd_red5.1) By overwriting nerd_red, I permanently added the age_character column to the data frame. Can you turn it into a numeric column again? You can replace it using the old column name (i.e. mutate(age_character = ...)). (Of course, the name will be pretty misleading afterwards.)
nerd_red <-
nerd_red %>%
mutate(age_character = as.numeric(age_character))5.2) According to the code book, gender as been coded as 1 = Male, 2 = Female, 3 = Other. Assume there has been a mistake and it’s actually 0 = Male, 1 = Female, 2 = Other. How can you correct the mistake? (This is more of a “thinking exercise” than a coding exercise.) Don’t overwrite the data frame, because we want to keep the old (correct) gender column as it is.
nerd_red %>%
mutate(gender = gender - 1)Now we still have the useless age_character column in our data. If we want to get rid of it, we can do so by simply “deselecting” it:
nerd_red <- nerd_red %>%
select(-age_character)Alternative base R way:
# nerd_red$age_character <- NULLIn a lot of cases, we want to calculate the sum of columns. Say we want to calculate the the sum of Q1 and Q2 in a new column called sum_Q1_Q2. This is how we would do that:
nerd_red %>%
mutate(sum_Q1_Q2 = Q1 + Q2, .after = Q2)We can see how the columns Q1 and Q2 add up per row in our new column. But what if we want to calculate the sum score across all the columns Q1 - Q26? Writing all of them down with a little plus sign seems tedious.
We can achieve this with a few twists. Let’s consider the third row of this code chunk first, the bit with the mutate() function. We can see that we create a column called nerd_score. We then use the function sum(), for obvious reasons. Next, we see the function across(). This is a so-called “helper function” because it is only used inside of mutate(), so mutate() will know what columns to work on. Basically, we need across() every time mutate() needs to use the same function on multiple columns at once. (Adding two columns together is a special case.) It actually matches the verbal description I used above: I want to calculate the sum score across columns Q1 - Q26. So far, so good. However, one crucial piece of code that we need is the additional rowwise() function before using mutate(). It does exactly what the name suggests: It causes all following functions to be executed separately on every single row of a data frame. That is, we calculate the sum for every row (= participant) separately. This might seem slightly confusing because we just saw that adding vectors automatically sums up the rows. Most functions work like that in R, but some don’t - one of them is sum(). It just sums up everything, so if we used it without rowsums(), we would get the entire sum of all the elements in all the columns we selected. Note that sum() is not the same as +.
As a last step, we ungroup() the data, because rowwise() groups the data into rows and we don’t want it to stay like that (otherwise, it would). Note that I overwrote nerd_red with the version of nerd_red that now has a column with the nerd sum score.
# use rowwise ...
# to calculate the nerd score ...
# as the sum ...
# across ...
# Q1 - Q26
# ungroup at the end
nerd_red %>%
rowwise() %>%
mutate(nerd_score = sum(across(Q1:Q26)), .before = Q1) %>%
ungroup()I quickly want to give a shoutout to base R here because (one of the) ways you can solve the problem there is to use the function rowSums(), which does exactly what the name suggests. The clever part is to select the columns using the function paste0(). Take a look at the whole code first:
# base R way
nerd_red$nerd_score <-
rowSums(nerd_red[ , paste0("Q", 1:26)])Then, check out how paste0() (and her sister paste()) work.
paste0("Q", 1:5)## [1] "Q1" "Q2" "Q3" "Q4" "Q5"
paste("Q", 1:5)## [1] "Q 1" "Q 2" "Q 3" "Q 4" "Q 5"
As many things in R, paste0() is also vectorised.
paste0(c("A", "B"), 1:4)## [1] "A1" "B2" "A3" "B4"
So, selecting columns the base R way can be as elegant as this, which can be very handy later when you’re writing flexible functions:
head(
nerd_red[ , paste0("Q", 1:26)]
)Note that I do not even need to use the comma here (i.e. nerd_red[ , paste0("Q", 1:26)] works perfectly fine), but I like to explicitly leave the rows “blank” so it becomes obvious that I’m referring to all rows.
Sometimes, you want to modify existing columns in your data. For example, we could turn the numbers in the gender column into text, so it’s easier to interpret. We can see in the code book that 1 = Male, 2 = Female, 3 = Other. This is how we would recode the gender column:
# Note that we overwrite nerd_red here, so the column gender is permanently changed
nerd_red <-
nerd_red %>%
mutate(
gender = case_when(
gender == 1 ~ "male",
gender == 2 ~ "female",
gender == 3 ~ "other"
)
)
nerd_red Let’s unpack this: case_when(), as the name suggests, does stuff when a certain case occurs. Whenever the gender column is 1, we insert “male”, when it is 2, we put in “female”, and when it is 3, we put in “other”. Note that it is important to specify all possible cases. If we only had put in values for the cases 1 and 2 (so, male and female), every occurrence of 3 would have been left blank (NA).
When we have only two cases, the function ifelse() is a useful shortcut. It takes three arguments:
For example, if we want to classify these numbers into “one digit” or “more than one digit”, it looks like this:
some_numbers <- c(2, 24, 12, 5, 1, 353, 1)
ifelse(some_numbers < 10, "one", "more than one")## [1] "one" "more than one" "more than one" "one"
## [5] "one" "more than one" "one"
We can also modify the data based on this. For example, we could multiply the data with -1 whenever it has more than one digit, and leave it as it is when it is one digit.
ifelse(some_numbers < 10, some_numbers, some_numbers * -1)## [1] 2 -24 -12 5 1 -353 1
Note how we have to call some_numbers, even if we’re leaving things unchanged. You can imagine it like this: Suppose you’re creating a new vector, filling up the positions one by one. For every position, you contemplate whether you should put in the old number from some_numbers, or whether you should multiply it with -1 before adding it.
6.1) Recode the married column. It’s 1 = Never married, 2 = Currently married, 3 = Previously married. Overwrite nerd_red with the version that contains the recoded column.
nerd_red <-
nerd_red %>%
mutate(
married = case_when(
married == 1 ~ "never married",
married == 2 ~ "currently married",
married == 3 ~ "previously married"
)
)6.2) Recode the voted column. It’s 1 = Yes, 2 = No. Use ifelse(). Overwrite nerd_red with the version that contains the recoded column.
nerd_red %>%
mutate(voted = ifelse(voted == 1, "yes", "no"))6.3) Like ifelse(), case_when() can be used outside of mutate(). Can you use case_when() with the following vector so that you: 1) add 20 when the number is below 0 2) subtract 2 when the number is above 0 3) put in 9999 when the number is exactly 0
some_more_numbers <- c(12, 0, 0, -4, 0, -12, 5, 2)case_when(
some_more_numbers < 0 ~ some_more_numbers + 20,
some_more_numbers > 0 ~ some_more_numbers - 2,
some_more_numbers == 0 ~ 9999
)## [1] 10 9999 9999 16 9999 8 3 0
One last thing. We see that every participant is a row in our data frame, but there is no participant id. This is in principle not a big issue for this data set, but for some data sets, it’s really important that everything stays in the correct order. Some operations on the data may change the order of it, and then it is crucial to be able to recover the original order of the data. We can add an id column to your data by numbering the rows.
The tidyverse has a built-in function for this, which goes like this:
nerd_red <-
nerd_red %>%
rowid_to_column(var = "id")You could also number the rows like this:
nerd_red %>%
mutate(id = 1:nrow(.), .before = age)THIS IS A GITHUB PUSH CHECKPOINT
Notice that I’m not using quotation marks here. However, both works: library(tidyverse) and library("tidyverse") both do the same job.↩︎
This thing is a pain in the a** to type by hand. I strongly recommend creating a shortcut. You can do this via Tools \(\to\) modify keyboard shortcuts. Look for “insert pipe operator” in the search bar. I chose “alt + .” to add the pipe, simply because “alt + -” inserts the assignment operator <- per default.↩︎
In fact, a brand new pipe operator has just been added to base R, which looks like this: |>|. It works much like the tidyverse pipe, but not quite. You can’t exchange one for the other, at least not at the time of writing. The base R people is officially available since 18.05.2021 (since version 4.1.0), so it will probably face some developments during the next years.↩︎
Of course, it is bad to design exclusion criteria after looking at the data and ideally, the online form would have prevented people from entering implausible age values or excluded them right away.↩︎
The alternative is to wrap the code into (). Then, the output is printed even if I assign something to a variable. See the difference between example <- 2 + 3 and (example <- 2 + 3).↩︎
The reason is that inside the mutate() function, anything that looks like something = whatver would create a column of the name “something”. That means if we just wrote before = gender, mutate() would create a column called “before”, that is a duplicate of “gender”.↩︎